A new era in machine translation research
نویسنده
چکیده
In the 1980s the dominant framework of MT was essentially ‘rule-based’, e.g. the linguistics-based approaches of Ariane, METAL, Eurotra, etc.; or the knowledge-based approaches at Carnegie Mellon University and elsewhere. New approaches of the 1990s are based on large text corpora, the alignment of bilingual texts, the use of statistical methods and the use of parallel corpora for ‘example-based’ translation. The problems of building large monolingual and bilingual lexical databases and of generating good quality output have come to the fore. In the past most systems were intended to be general-purpose; now most are designed for specialised applications, e.g. restricted to controlled languages, to a sublanguage or to a specific domain, to a particular organisation or to a particular user-type. In addition, the field is widening with research under way on speech translation, on systems for monolingual users not knowing target languages, on systems for multilingual generation directly from structured databases, and in general for uses other than those traditionally associated with translation services. Introduction At the end of the 1980s, machine translation entered a period of innovation in methodology which has changed the framework of research. What has changed? What was the situation in MT five years ago? Between 1975 and 1988 a large number of operational and commercial systems had appeared: Systran, Logos, Meteo, and in particular many Japanese systems. These systems were based in general either on the ‘direct’ approach to translation, or on the method of syntactic transfer. They relied on bilingual dictionaries sufficient for the text domains in question; linguistic analysis was neither particularly deep or abstract, there was hardly any semantic analysis, and the use of non-linguistic knowledge was entirely absent. As for research, the dominant framework of MT research until the end of the 1980s was the approach based on essentially linguistic rules on various kinds: rules for morphological and syntactic analysis, lexical rules, rules for lexical transfer, rules for syntactic generation, etc. Although the so-called ‘transfer’ systems dominated, e.g. Ariane, Metal, SUSY, Mu and Eurotra, there appeared in the later 1980s various ‘interlingual’ systems. Some were still essentially linguistics-oriented (DLT and Rosetta), but others adopted knowledge-based approaches, making use of non-linguistic information about the domains of texts to be translated. The most notable centre for this research has been Carnegie Mellon University. Nevertheless, these newer knowledge-based systems continued to be essentially rule-based systems, and in any case they remained somewhat of a novelty till almost the end of the decade. Since 1989 the predominantly rule-based framework has been broken by the emergence of new methods and strategies which are now loosely called ‘corpus-based’ methods. Firstly, a group from IBM published in 1989 the results of experiments on a system based purely on statistical methods. The effectiveness of the method was a considerable surprise to many researchers and has inspired others to experiment with statistical methods of various kinds in subsequent years. Secondly, at the very same time certain Japanese groups began to publish preliminary results using methods based on corpora of translation examples, i.e. using the approach now generally called ‘example-based’ translation. For both approaches the principal feature is that no syntactic or semantic rules are used in the analysis of texts or in the selection of lexical equivalents. This paper will concentrate on these new developments in MT research. It will not describe any one project in detail, and projects are mentioned only as examples of trends there are many others; for further details and for references to the systems mentioned see my recent fuller survey. The paper will also say almost nothing about methods already well established by the end of the 1980s. Furthermore, nothing will be said about the use of commercial systems or the development of aids for translators. The subject is exclusively the development of new methods in MT research. Of course, many of the methods are still experimental and have not yet been tested on a large scale. Nevertheless, the trends are real; since 1989 MT has experienced a reorientation of its methodology sufficient to justify calling the 1990s a genuinely ‘new era’. Rule-based systems Before describing these new corpus-based developments in detail I shall begin with rule-based approaches, since here too there have been important theoretical and methodological developments. Five or six years ago saw the end of two of the most significant transfer-based projects: the Ariane project at Grenoble University and the Eurotra project of the European Communities. These systems exemplified typical features of the so-called ‘second-generation’ systems: batch processing with post-editing and no interactive components, essentially syntax-oriented and stratificational with three stages of analysis, transfer and synthesis and the processes of analysis and generation passing through series of distinct levels (morphology, syntax and semantics), relatively abstract interface representations in the form of labelled trees, rules of transduction for changing trees from one level to another, and making little use of pragmatic and discourse information. Nevertheless, these projects do “live on” to a certain extent in the Eurolang project based at SITE, a French company previously connected with the Ariane project. The project involves collaboration with the German company Siemens-Nixdorf and its Metal system and it is benefiting from experience with Eurotra. The first product of Eurolang is, however, not an MT system as such but a translator’s workstation, the Optimizer. Other transfer-based systems continue in the present decade. There is, for example, the already mentioned commercial system Metal, and the major research at various IBM centres on the LMT (Logic programming MT) system. The beginning of this decade saw also the end of some rule-based ‘interlingual’ research systems: the DLT project in Utrecht, based on Esperanto as interlingua, and the Rosetta project at Philips which explored an isomorphic approach to constructing interlingual representations and the integration of Montague semantics. However, major interlingual projects continue to thrive, indeed with even more vigour, particularly in the knowledge-based approach at Carnegie Mellon University. The distinctive features are familiar: a neutral intermediary language for representing the meanings of texts (interlingua) and knowledge databases related to the domain of the texts to be translated. Several models have been developed over the years, and in 1992 was announced the beginning of a collaborative project with the Caterpillar company with the aim of creating a large-scale high-quality system for technical manuals in the specific domain of heavy earth-moving equipment. Other interlingual systems are, e.g. the ULTRA system at the New Mexico State University, and the UNITRAN system based on the linguistic theory of Principles and Parameters. There is also the Pangloss project, an interlingual system restricted to the vocabulary of mergers and acquisitions, a collaborative project involving experts from the universities of Southern California, New Mexico State and Carnegie Mellon. Pangloss is itself one of three MT projects supported by DARPA, the others being the IBM statistics-based project (see below) and a system being developed by Dragon Systems, a company which has been particularly successful in speech research but with no previous experience in MT. The ‘lexicalist’ tendency A characteristic feature of rule-based systems is the transformation or mapping of labelled tree representations. For example (Fig.1), in Eurotra a series of tree transductions was proposed: from a morphological tree into a syntactic tree, from a syntactic tree into a semantic tree, from an interface tree of the source language into an equivalent target-language tree, and so forth. Transduction rules require the satisfaction of precise conditions: a tree must have a specific structure and contain particular lexical items or specific syntactic or semantic features. In addition, every tree is tested by formation rules; in effect, a ‘grammar’ confirms the acceptability of its structure and the relationships it represents. A tree is rejected if it does not conform to the grammatical rules of the level in question: morphological, syntactic, semantic, etc. Grammars and transduction rules specify the constraints which determine the possibility of transfer from one level to another and hence, in the end, the transfer of a source-language text to a target-language text. source text ANALYSIS Grammar rules G1 Representation L1 Transformation rules T1/2 Grammar rules G2 Representation L2 Transformation rules T2/3 Grammar rules G3 Representation L3
منابع مشابه
Latest Developments in Machine Translation Technology: beginning a new era in machine translation research
متن کامل
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملLatest Developments in Machine Translation Technology: Beginning a New Era in MT Research
Six years ago at the first MT Summit conference, the field of MT was dominated by approaches which had been established in the late 1970s. These were the systems which had built upon experience gained in what may be called the 'quiet' decade of machine translation, the ten years after the publication of the ALPAC report in 1966 had brought to an end MT research in the United States and had prof...
متن کاملبهبود و توسعه یک سیستم مترجمیار انگلیسی به فارسی
In recent years, significant improvements have been achieved in statistical machine translation (SMT), but still even the best machine translation technology is far from replacing or even competing with human translators. Another way to increase the productivity of the translation process is computer-assisted translation (CAT) system. In a CAT system, the human translator begins to type the tra...
متن کاملA Comparative Study of English-Persian Translation of Neural Google Translation
Many studies abroad have focused on neural machine translation and almost all concluded that this method was much closer to humanistic translation than machine translation. Therefore, this paper aimed at investigating whether neural machine translation was more acceptable in English-Persian translation in comparison with machine translation. Hence, two types of text were chosen to be translated...
متن کاملI-30: Human Endometrial Receptivity:from The Basic Research to Clinical Translation
The endometrium is a hormonally regulated organ that is non-adhesive to embryos throughout most of the menstrual cycle in humans. Endometrial receptivity refers to a hormone-limited period in which the endometrial tissue acquires a functional and transient ovarian steroid-dependent status allowing blastocyst adhesion. Functional genomic studies of human endometrium in natural cycles have demons...
متن کامل